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Abstract 

This paper addresses the problem of specifying and parsing the 
syntax of domain-specific languages (DSLs) in a modular, user- 
friendly way. That is, we want to enable the design of composable 
DSLs that combine the natural syntax of external DSLs with the 
easy implementation of internal DSLs. The challenge in parsing 
composable DSLs is that the composition of several (individually 
unambiguous) languages is likely to contain ambiguities. In this 
paper, we present the design of a system that uses a type-oriented 
variant of island parsing to efficiently parse the syntax of compos- 
able DSLs. In particular, we show how type-oriented island parsing 
is constant time with respect to the number of DSLs imported. We 
also show how to use our tool to implement DSLs on top of a host 
language such as Typed Racket. 

1. Introduction 

Domain-specific languages (DSLs) provide high productivity for 
programmers in many domains, such as computer systems, physics, 
linear algebra, and other sciences. However, a series of trade-offs 
face the prospective DSL designer today. On the one hand, external 
DSLs offer natural syntax and friendly diagnostics at the cost of 
interoperability issues (Beazley[T996| and difficulty of implemen- 
tation. They are usually either implemented by hand or by using 
parser generators a la YACC that require technical knowledge of 
parsing algorithms. Meanwhile, many general-purpose languages 
include a host of tricks for implementing internal (or embedded) 
DSLs, such as templates in C++, macros in Scheme, and type 
classes in Haskell; however, the resulting DSLs are often leaky ab- 
stractions: the syntax is not quite right, compilation errors expose 
the internals of the DSL, and debuggers are not aware of the DSL. 

In this paper, we make progress towards combining the best of 
both worlds into what we call composable DSLs. We want to enable 
fine-grained mixing of languages with the natural syntax of external 
DSLs and the interoperability of internal DSLs. 

At the core of this effort is a parsing problem: although the 
grammar for each DSL may be unambiguous, programs that use 
multiple DSLs, such as the one in Figure[T] need to be parsed using 
the union of their grammars, which are likely to contain ambigu- 
ities (Kats et al.|2010[ . Instead of relying on the grammar author 
to resolve them (as in the LALR tradition), the parser for such an 
application must be able to efficiently deal with ambiguities. 
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Figure 1. Our common case: an application using many DSLs. 



We should emphasize that our goal is to create a parsing system 
that provides much more syntactic flexibility than is currently of- 
fered through operator overloading in languages such as C++ and 
Haskell. We are not trying to build a general purpose parser, that is, 
we are willing to place restrictions on the allowable grammars, so 
long as those restrictions are easy to understand (for our users) and 
do not interfere with composability. 

As a concrete, motivating example, we consider the union of 
grammars for matrix algebra, regular expressions, and sets outlined 
in Figure|2] Written in the traditional style, the union of these indi- 
vidually unambiguous grammars is greatly ambiguous; so import- 
ing many DSLs such as these can increase the parse time by orders 
of magnitude even though the program is otherwise unchanged. Of 
course, an experienced computer scientist will immediately say that 
the separate grammars should be merged into one grammar with 
only one production for each operator. However, that would require 
coordination between the DSL authors and is therefore not scalable. 

1.1 Type-Oriented Grammars 

To address the problem of parsing composed DSLs, we observe that 
different DSLs define different types: Matrix, Vector, and Scalar in 
Matrix Algebra, Regexp in Regular Expressions, Set in Sets, and so 
on. We suggest an alternate style of grammar organization that we 
call type-oriented grammars, inspired by Sandberg 1 1982 1. In this 
style, a DSL author creates one nonterminal for each type in the 
DSL and uses the most specific nonterminal/type for each operand 
in a grammar rule. Figure [5] shows the example from Figure [2] 
rewritten in a type-oriented style, with nonterminals for Matrix, 
Vector, Scalar, Regexp, and Set. 

1.2 Type-based Disambiguation 

While the union of the DSLs in Figure [3] is no longer itself am- 
biguous, programs such as A + B + C ■ • ■ are still highly ambigu- 
ous if the variables A, B, and C can each be parsed as either Matrix, 
Regexp, or Set. Many prior s ystems (P aulson 1994 Bravenboer] 
|et al.|2005| use chart parsing (Kay|1986j or GLR [Ibmita|1985J 
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module MatrixAlgebra { 

Expr ::= Expr "+" Expr [left.l] 
I Expr "-" Expr [left.l] 
I Expr "*" Expr [left, 2] 
I " I " Expr "1" I Id; • • • 

} 

module RegularExpressions { 

Expr ::= " Char " "' I Expr " + " I Expr "*" 
I Expr " I " Expr [left] I Id; ■ ■ ■ 

} 

module Sets { 

Expr ::= Expr "+" Expr [left.l] 

I Expr "-" Expr [left, 2] I Id; ••• 

} 

import MatrixAlgebra, RegularExpressions, Sets; 
A + B + C //Ambiguous! 



Figure 2. Ambiguity due to tlie union of DSLs. 



module MatrixAlgebra { 

Matrix ::= Matrix "+" Matrix [left,l] 
I Matrix "-" Matrix [left,l] 
I Matrix "*" Matrix [left, 2]; 

Scalar : := " I " Vector " I "; ■ ■ ■ 

} 

module RegularExpressions { 

Regexp ::= " "' Char " I Regexp " + " 

I Regexp "*" I Regexp "I" Regexp; ••• 

} 

module Sets { 

Set : := Set "+" Set [left,l] 

I Set "-" Set [left, 2] ; • • • 

} 

import MatrixAlgebra, RegularExpressions, Sets; 
declare A: Matrix, B: Matrix, C: Matrix { 
A + B + C 

} 



Figure 3. Type-oriented grammars for DSLs. 



1.3 Contributions 

1. We present the first parsing algorithm, type-oriented island 
parsing (Section|3|, whose time complexity is constant with re- 
spect to the number of DSLs in use, so long as the nonterminals 
of each DSL are largely disjoint (Section|4j. 

2. We present our extensible parsing systerrQthat adds several fea- 
tures to the parsing algorithm to make it convenient to develop 
DSLs on top of a host Isinguage such as Typed Racket | Tobin^ 
[Hochstadt and Felleisen|20 081 (Section|5]l. 

3. We demonstrate the utility of our parsing system with an exam- 
ple in which we embed syntax for two DSLs in Typed Racket. 

Section [2] introduces the basic definitions and notation used in 
the rest of the paper. We discuss our contributions in relation to the 
prior literature in Section|6]and conclude in Section|7] 

2. Background 

We review the definition of a grammar and parse tree and present 
our framework for comparing parsing algorithms, which is based 
on the parsing schemata of Sikkel 1 1998J . 

2.1 Grammars and Parse Trees 

A context-free grammar (CFG) is a 4-tuple Q = {Ti^/^jV^S) 
where E is a finite set of terminals, A is a finite set of nonterminals, 
V is finite set of grammar rules, and 5* is the start symbol. We use 
a, b, c, and d to range over terminals and A, B, C, and D to range 
over nonterminals. The variables X, Y, Z range over symbols, that 
is, terminals and nonterminals, and a, /3, 7, 5 range over sequences 
of symbols. Grammar rules have the form A ^ ol. We write 
C/ U (A a) as an abbreviation for (E, A,P U (A a), 5"). 

We are ultimately interested in parsing programs, that is, con- 
verting token sequences into abstract syntax trees. So we are less 
concerned with the recognition problem and more concerned with 
determining the parse trees for a given grammar and token se- 
quence. The parse trees for a grammar Q — (E, A, V, S), written 
T{Q), are trees built according to the following rules. 

1. If a G E, then a is a parse tree labeled with a. 

2. If ti, . . . ,tn are parse trees labeled Xi , . . . , X„ respectively, 
A e A, and A ^ Xi, . . . , Xn e V, then the following is a 
parse tree labeled with A. 

A 



to produce a parse forest and then type check to filter out the ill- 
typed trees. This solves the ambiguity problem, but these parsers 
are inefficient on ambiguous grammars (Section|4](. 

This is where our key contribution comes in: island parsing with 
eager, type-based disambiguation. We use a chart parsing strategy, 
called is land parsing (Stock et al.|1988 | (or bidirectional bottom-up 
parsing fQuesada' 19981), that enables our algorithm to grow parse 
trees outwards from well-typed terminals. The statement 

declare A : Matrix , B : Matrix , C : Matrix { . . . } 

gives the variables A, B, and C the type Matrix. We then integrate 
type checking into the parsing process to prune ill-typed parse 
trees before they have a chance to grow, drawing inspiration from 
from the field of natural language processing, where using types to 
resolve ambiguity is known as selection restriction [Jurafsky and| 
|Martin|2009| , 

Our approach does not altogether prohibit grammar ambigui- 
ties; it strives to remove ambiguities from the common case when 
composing DSLs so as to enable efficient parsing. 



tl ■ ■ ■ in 

We sometimes use a horizontal notation A — )■ t\ . . .tn for parse 
trees and we often subscript parse trees with their labels, so Ia is 
parse tree t whose root is labeled with A. We use an overline to 
represent a sequence: t = t\, . . . ,tn. 

The yield of a parse tree is the concatenation of the labels on its 
leaves: 

yield(a) = a 
yield{[A — > ti . . . t„]) = yield{t\) . . . yield{t„) 

Definition 2.1. The set of parse ti'ees for a CFG G = {T,,A,V,S) 
and input w, written T{G, w), is defined as follows 

T{Q,w) = {ts I ts e T{G) and yield(ts) = w} 

Definition 2.2. The language of a CFG G, written L{G), consists 
of all the strings for which there is exactly one parse tree. More 

' See the supplemental material for the URL for the code. 
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formally, 



L{g) = {w\ \T{g,w)\ = i} 



2.2 Parsing Algorithms 

We wish to compare the essential characteristics of several parsing 
algorithms without getting distracted by implementation details. 
|Sikkel| |1998| introduces a high-level formalism for presenting 
and comparing parsing algorithms, called parsing schemata, that 
present each algorithm as a deductive system. We loosely follow 
his approach, but make some changes to better suit our needs. 

Each parsing algorithm corresponds to a deductive system with 
judgments of the form 

where ^ is an item and H is ^ set of items. An item has the form 
[p,i,j] where p is either a parse tree or a partial parse tree and 
the integers i and j mark the left and right extents of what has 
been parsed so far The set of partial parse trees is defined by the 
following rule. 



If a/37 G T', then A 
labeled with A. 



a.tp.'y is a partial parse tree 



We reserve the variables s and t for parse trees, not partial parse 
trees. A complete parse of an input w of length n is a derivation of 
^^0 l~ [is, 0, n], where Hq is the initial set of items that represent 
the result of tokenizing the input w. 

Ho = {[wi,i,i+l] I < i < \w\} 



Exam ple 2.3. The (top-down) Earley algorithm |Earley|[T968| 
|1970| applied to a grammar Q — (E, A, P, S) is defined by the 
following deductive rules. 



(Hyp)- 



(Fnsh)- 



H\-[A 



H\-[A^ tc,i,j] 



(INIT)- 



s ^-yer 



H\-[S~^ ..7,0,0] 



(Pred)- 



(COMPL) 



H \- [A^ .to,.BI3,i,j] B^-yeV 



H\-[B 



■1,3,J\ 



H^[A^ .s^.Xp,i,j] Hh[tx,j,k] 



H\-[A^ .So,tx.l3,i,k]} 

Example 2.4. A bottom-up variation (Sikkel|1998| of Earley pars- 
ing is obtained by replacing the initialization (INIT) and prediction 
(Pred) rules with the following bottom-up rule (BU). 

Hh[tx,hj] A^XP£V 



(BU)- 



H'r[A^ .tx.P,i,i] 



3. Type-Oriented Island Parsing 

The essential ingredients of our parsing algorithm are type-based 
disambiguation and island parsing. In Section |4] we show that an 
algorithm based on these two ideas parses with time complexity 
that is independent of the number of DSLs in use, so long as the 
nonterminals of the DSLs are largely disjoint. (We also make this 
claim more precise.) But first, in this section we introduce our type- 
oriented island parsing algorithm (TIP). 

Island parsing [Stock et al. 1988 1 is a bidirectional, bottom- 
up parsing algorithm that was developed in the context of speech 
recognition. In that domain, some tokens can be identified with a 
higher confidence than others. The idea of island parsing is to begin 
the parsing process at the high confidence tokens, the so-called 
islands, and expand the parse trees outward from there. 



Our main insight is that if our parser can be made aware of 
variable declarations, and if a variable's type corresponds to a non- 
terminal in the grammar, then each occurrence of a variable is 
treated as an island. We introduce the following special form for 
declaring a variable a of type A that may be referred to inside the 
curly brackets. 

declare a : A {. . .} 

For the purposes of parsing, the rule A — >■ a is added to the gram- 
mar while parsing inside the curly brackets. To enable temporarily 
extending the grammar, we augment the judgments of our deduc- 
tive system with an explicit parameter for the grammar. So judg- 
ments have the form 

This adjustment also enables the import of grammars from different 
modules. 

We formalize the parsing rule for the declare form as follows. 

gyj{A^a)-Hh[tx,i + 5,j] 

(DECL) ; 

5; ii" h [X ^ declare a : A {ix}, j, j + 1] 

Next we split the bottom-up rule (BU) into the two following 
rules. The (ISLND) rule triggers the formation of an island using 
grammar rules of the form A ^ a, which arise from variable 
declarations and from literals (constants) defined in a DSL. The 
(IPred) rule generates items from grammar rules that have the 
parsed nonterminal B on the right-hand side. 

g;Hh[a,i,j] A^a&V g = {T.,A,r,s) 



(Islnd)- 



(IPred)- 



g;H^[A^a,i,j] 

g;Hh[tB,i,j] 
A^aBpeV g = (E, A, P, 5') 



g-Hh[A^a.tB.P.i,3\ 
Finally, because islands appear in the middle of the input string, we 
need both left and right-facing versions of the (COMPL) rule. 

g;HV- [A^ a.sp.X-i,i,j] g-HV- [tx,3,k] 



(RCOMPL)- 



(LCOMPL) 



g\H\- \A^ a.sptx-^,i, k]} 
g-Hh[tx,i,j\ g\Hh[A^aX.sii.-i,j,k] 



g;H\- [A^ a.txS/s.'y, i, k]} 

Definition 3.1. The type-oriented island parsing algorithm is 
defined as the deductive system comprised of the rules (Hyp), 
(Fnsh), (Decl), (Islnd), (IPred) (RCompl), and (LCompl). 

The type-oriented island parsing algorithm requires a minor re- 
striction on grammars. If the right-hand side of a rule does not con- 
tain any nonterminals, then it may only contain a single terminal. 
This restriction means that our system supports single-token lit- 
erals but not multi-token literals. For example, the grammar rule 
A — > "foo" "bar" is not allowed, but A "foobar" and 
A "foo" B "bar" "baz" are allowed. 

4. Experimental Evaluation 

In this section we evaluate the performance of type-oriented island 
parsing with experiments in two separate dimensions. First we 
measure the performance of the algorithm for programs that are 
held constant but the size of the grammars increase, and second we 
measure the performance for programs that increase in size while 
the grammars are held constant. 

4.1 Grammar Scaling 

Chart parsing algorithms |Kay||1986) have a general worst-case 
running time oi 0{\g\rC') for a grammar g and string of length n. 
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— bottom-up Earley 
top-down Earley 



(a) Untyped Grammar 



(b) Semi-typed Grammar 



15 19 23 27 31 35 



(c) Type-oriented Grammar 



Figure 4. Comparison of top-down, bottom-up, and island parsing with three styles of grammars. 



In our setting, Q is the union of the grammars for all the k DSLs that 
are in use within a given scope, that isQ — IJi^=i where Qi is 
the grammar for DSL i. We claim that the total size of the grammar 
Q is not a factor for type-oriented island parsing, and instead the 
time complexity is 0{mn'^) where m = max{|C/i| | 1 < i < fc}. 
This claim deserves considerable explanation to be made precise. 

Technically, we assume that Q is sparse and that the terminals 
of Q are well-typed, which we define as follows. 

Deflnition 4.1. Form a Boolean matrix with a row for each non- 
terminal and a column for each production rule in a grammar Q. 
A matrix element is true if the nonterminal i appears on the 
right-hand side of the rule j, and it is false otherwise. We say that Q 
is sparse if its corresponding matrix is sparse, that is, if the number 
of nonzero elements is much smaller than the number of elements. 

Definition 4.2. We say that a terminal a of a grammar Q is well- 
typed if for each B such that B ~^ a £ V, B represents a type in 
the language of Q. 

We expect that the terminals of type-oriented grammars will be 
well-typed, and hypothesize that, in the common case, the union of 
many type-oriented grammars (or DSLs) is sparse. 

To verify that both the type-oriented style of grammar and the 
island parsing algorithm are necessary for this result, we show 
that removing either of these ingredients results in parse times that 
are dependent on the size of the entire grammar. Specifically, we 
consider the performance of the top-down and bottom-up Earley 
algorithms, in addition to island parsing, with respect to untyped, 
semi-typed, and type-oriented grammars. 

We implemented all three algorithms in a chart parsing frame- 
work [,Kay|1986J , which efficiently memoizes duplicate items. The 
chart parser continues until it has generated all items that can be 
derived from the input string. (It does not stop at the first complete 
parse because it needs to continue to check whether the input string 
is ambiguous, which means the input would be in eiTor.) Also, we 
should note that our system currently employs a fixed tokenizer, but 
that we plan to look into scannerless parsing. 

To capture the essential, asymptotic behavior of the parsing 
algorithms, we measure the number of items generated during the 
parsing of the program. 

4.1.1 A Small Experiment 

For the first experiment we parse the expression — A with untyped, 
semi-typed, and typed grammars. 

Untyped In the untyped scenario, all grammar rules are defined in 
terms of the expression nonterminal (E), and variables are simply 
parsed as identifiers (Id). 



module Untyped'' { 
E : := Id I "-" E; 

} 

The results for parsing — A after importing k copies of Untyped, 
for increasing k, are shown in Figure [4(a)| The y-axis is the number 
of items generated by each parsing algorithm, and the x-axis is the 
total number of grammar rules at each k. In the untyped scenario, 
the size of the grammar affects the performance of each algorithm, 
with each generating 0{k^) items. 

We note that the two Earley algorithms generate about half as 
many items as the island parser because they are unidirectional 
(left-to-right) instead of bidirectional. 

Semi-typed In the semi-typed scenario, the grammars are nearly 
type-oriented: the Semityped" module defines the nonterminal V 
(for vector) and each of Semityped' for i G {1, . . . , A:} defines the 
nonterminal Mi (for matrix); however, variables are again parsed as 
identifiers. We call this scenario semi-typed, because it doesn't use 
variable declarations to provide type-based disambiguation. 

module Semityped" { 
E ::= V; 

V : := Id I "-" V; 

} 

module Semityped' { 
E : : = Mi ; 

Mi : := Id I "-" Mi; 

} 

The results for parsing — A after importing Semityped " fol- 
lowed by Semityped' for i G {1, A:} are shown in Figure 4(b) 
The lines for bottom-up Earley and island parsing coincide. Each 
algorithm generates 0{k) items, but we see that type-oriented 
grammars are not, by themselves, enough to achieve constant scal- 
ing with respect to grammar size. 

We note that the top-down Earley algorithm generates almost 
twice as many items as the bottom-up algorithms: the alternatives 
for the start symbol E grow with n, which affects the top-down 
strategy more than bottom-up. 

Typed The typed scenario is identical to semi-typed except that it 
no longer includes the Id nonterminal. Instead, programs using the 
Typed module must declare their own typed variables. 

module Typed" { 
E : := V; 

V • •= "-" V- 

} 
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island 

bottom-up Earley 
top-down Earley 



183 221 259 297 335 
f \ = number of grammar rules 



Figure 5. Comparison of parsing algorithms for a type-oriented 
matrix algebra DSL and increasing grammar size. 



module Typed' { 
E : : = Mi ; 
Mi : := "-" Mi; 

} 

The results for parsing — A after importing Typed" followed by 
Ty ped' for i £ {1, . . . , fc} and declaring A: V are shown in Fig- 
ure ' 



' 4(c) The sparsity for this example is 0(l/fc), and now the ter- 
minal (as V) is well-typed. The island parsing algorithm generates 
a constant number of items as the size of the grammar increases, 
while the Earley algorithms remain linear. Thus, the combination 
of type-based disambiguation, type-oriented grammars, and island 
parsing provides a scalable approach to parsing programs that use 
many DSLs. 

4.1.2 A Larger Experiment 

For a second experiment we measure the performance of each al- 
gorithm for a sequence of matrix algebra operations with expanded 
versions of the grammars in Figure [5] 

import MatrixAlgebra, RegularExpressions'' , Sets'" ; 



ul * 
(B' =1 
(B * 



vl' 

y) 

x); 



+ u2 

+ z; 



* v2' 



In this example, we import grammars for RegularExpressions 
and Sets, k times each. For the untyped and semi-typed scenarios, 
the result is too ambiguous and we terminated their execution after 
waiting for several minutes. For the typed scenario, we declare the 
variables A and B as type Matrix; ul, u2, vl, v2, and w-z as type 
ColVector; a and b as type Scalar; the sparsity of the typed 
example is 0{l/k). 

Figure |5] shows a graph for parsing the above program with 
each algorithm. As before, the y-axis is the number of items gen- 
erated during parsing, and the x-axis is the number of DSLs that 
are imported. The top-down Earley algorithm scales linearly with 
respect to the number of DSLs imported and generates many more 
items than the bottom-up algorithms. The island parsing algorithm 
generates a constant number of items as the number of DSLs in- 
creases; the bottom-up Earley algorithm generates a similar number 
of items, but it scales slightly linearly. 

4.1.3 Discussion 

The reason type-oriented island parsing scales is that it is more con- 
servative with respect to prediction than either top-down or bottom 



up, and so grammar rules from other DSLs that are irrelevant to the 
program fragment being parsed are never used to generate items. 

Consider the (Pred) rule of top-down Earley parsing. Any 
rule that produces the non-terminal B, regardless of which DSL it 
resides in, will be entered into the chart. Note that such items have 
a zero-length extent which indicates that the algorithm does not yet 
have a reason to believe that this item will be able to complete. 

Looking at the (BU) rule of bottom-up Earley parsing, we see 
that all it takes for a rule (from any DSL) to be used is that it 
starts with a terminal that occurs in the program. However, it is 
quite likely that different DSLs will have rules with some terminals 
in common. Thus, the bottom-up algorithm also introduces items 
from irrelevant DSLs. 

Next, consider the (ISLND) rule of our island parser. There 
is no prediction in this rule. However, it is possible for different 
DSLs to define literals with the same syntax (same tokens). (Many 
languages forbid the overloading of constants, but it is allowed, for 
example, in Haskell.) The performance of the island parser would 
degrade in such a scenario, although the programmer could regain 
performance by redefining the syntax of the imported constants, in 
the same way that name conflicts can be avoided by the rename-on- 
import constructs provided by module systems. 

Finally, consider the (IPred) rule of our island parser. The 
difference between this rule and (BU) is that it only applies to 
nonterminals, not terminals. As we previously stated, we assume 
that the nonterminals in the different DSLs are, for the most part, 
disjoint. Thus, the (IPred) rule typically generates items based on 
rules in the relevant DSL's grammar and not from other DSLs. 

4.2 Program Scaling 

In this section we measure the performance of each algorithm as 
the size of the program increases and the grammar is held constant. 
The program of size n is the addition of n matrices using the matrix 
algebra grammar from the previous section. 

As before, we consider untyped, semi-typed, and typed scenar- 
ios. For these experiments we report parse times; we ran all of the 
experiments on a MacBook with a 2. 16 GHz Intel Core 2 Duo pro- 
cessor and 2 GB of RAM. 

Untyped The untyped scenario is exponentially ambiguous: 

import MatrixAlgebra, RegularExpressions, Sets; 
A + A + ■ • ■ + A 

While the above program with n terms produces 0(2") parse trees, 
the Earley and island parsing algorithms can produce a packed 
parse forest in polynomial space and time (Allen|I995) . 

Figure [6(a)| shows the results for each algorithm on a logarith- 
mic scale. The y-axis is the parse time (including production of 
parse trees), and the x-axis is the program size. Because our im- 
plementation does not use packed forests, all three algorithms are 
exponential for the untyped scenario. 

Semi-typed The program for the semi-typed scenario is identical 
to the untyped scenario and is also ambiguous; however, the num- 
ber of parse trees doesn't grow with increasing program size. Fig- 
ure |6(b)| shows the results for each algorithm, now on a linear scale. 
The axes are the same as before. Here the top-down Earley and is- 
land algorithms are O(n^). Although the number of correct parse 
trees remains constant, the bottom-up Earley algorithm explores an 
exponential number of possible trees as n increases before return- 
ing, and uses exponential time. 

Typed The program is no longer ambiguous in the typed scenario: 

import MatrixAlgebra, RegularExpressions, Sets; 
declare A: Matrix { 
A + A + • • • + A 
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Figure 6. Comparison of parsing algorithms for increasing program size. Figure (a) uses a logarithmic scale, with the program size ranging 
from 1 to 10. Figures (b) and (c) use a linear scale, with the program size ranging from 1 to 50 in (b) and ranging from 1 to 100 in (c). 



} 

Figure [6(c)| shows the results for each algorithm on a linear scale 
and with axes as before. All three algorithms are O(n^) for the 
typed scenario. These results suggest that type-oriented Farley and 
island parsing are O(n^) for unambiguous grammars. 

We should note that the top-down Barley algorithm parses the 
above program in 0{n) time when the grammars are rewritten to 
be LR(0); however, the bottom-up Barley and island algorithms 
remain 0{n^). 

5. A System for Extensible Syntax 

In this section we describe the parsing system that we have built 
as a front end to the Racket programming language. In particular, 
we describe how we implement four features that are needed in a 
practical extensible parsing system: associativity and precedence, 
parameterized grammar rules, grammar rules with variable binders 
and scope, and rule-action pairs | Sand berg|1982| which combine of 
the notions of semantic actions, function definitions, and macros. 

5.1 Associativity and Precedence 

We view associativity and precedence annotations (as in Figure |2] 
e.g., [left.l]) as a must for our parsing system because we do 
not expect all of our users to be computer scientists, that is, we do 
not expect them to know how to manually factor a grammar to take 
associativity and precedence into account. Further, even for users 
who are computer scientists, they probably have something better 
to do with their time than to factor grammars. 

Our treatment of associativity and precedence is largely based 
on that of Visser 1 1997 1, although we treat this as a semantic issue 
instead of an optimization issue. From the user perspective, we 
extend rules to have the form A — >■ where d indicates the 

associativity, where £ {left, right, non, _L}, and p indicates the 
precedence, where p G Nx. We use an ordering < on precedences 
that is the natural lifting of < on N. (Instead of <) we could 
use any partially ordered set, but prefer to be concrete here.) 

To specify the semantics of precedence and associativity, we 
use the notion of a filter to remove parse trees from consideration 
if they contain precedence or associativity conflicts [Visser 1997|. 
But first, we annotate our parse trees with the precedence and 
associativity, so an internal node has the form A -^e.p t. 

Definition 5.1. We say that a parse tree t has a root priority conflict, 
written conflict (t), if one of the following holds. 

I. It violates the right, left or non-associativity rules, that is, t has 
the form: 



• A -^i,p (A — tAa)sa where £ = right oi £ = non. 

• A -^e,p Sa{A — >f.p iaA) where £ = left or^ = non. 
2. It violates the precedence rule, that is, t has the form: 

t — A — )'f,p s{B -^11 y t)s' where p < p. 

Definition 5.2. A tree context C is defined by the following gram- 
mar. 

C :■- a\A^i,pti...C...t„ 
The operation of plugging a tree t into a tree context C, written C [t] , 
is defined as follows. 

a[t] = t 

{A -^e,p ti...C...t„)[t]= A ^e,p ti . . .C[t] . . .tn 

Definition 5.3. The filter for a CFG 5 is a function on sets of trees, 
J-: p{Tg) — >■ p(7g), that removes the trees containing conflicts. 
That is, 

= {te<i> l^t'C, t = C[t'] and conflict{t')} 

Definition 5.4. The set of parse trees for a grammar Q (with 
precedence and associativity) and input w, written T{Q,w), is 
defined as follows. 

T{G, w) = {ts 1 ts G J'iTig)) and yield{ts) = w} 

The change to the island parsing algorithm to handle precedence 
and associativity is straightforward. We simply make sure that 
a partial parse tree does not have a root priority conflict before 
converting it into a (complete) parse tree. We replace the (Fnsh) 
rule with the following rule. 

Q\H \- lA-^ .ta-,i,j] -^conflictiA ^ta) 

(FnshP) = ^ — 

g^H'^lA^ t^,i,j] 

5.2 Parameterized Rules 

With the move to type-oriented grammars, the need for parame- 
terized rules immediately arises. For example, consider how one 
might translate the following grammar rule for conditional expres- 
sions into a type-oriented rule. 

E : := "if" E "then" E "else" E 

We would like to be more specific than E for the two branches and 
for the left-hand side. So we extend our grammar rules to enable 
the parameterization of nonterminals. We can express a conditional 
expression as follows, where T stands for any type/nonterminal. 

forall T. 

T ::= "if" Bool "then" T "else" T 
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To simplify the presentation, we describe parameterized rules 
as an extension to the base island parser (without precedence). 
However, our parsing system combines both extensions. Here we 
extend grammar rules to include parameters: Va;. A ^ a. (x may 
not contain duplicates.) We use x, y, z to range over variables and 
we now use the variables A,B,C,D to range over nonterminals 
and variables. 

To handle parameters we need the notion of a substitution a, 
that is, a partial function from variables to nonterminals. The initial 
substitution ao is everywhere undefined. We extend the action of 
a substitution to all symbols, sequences, and rules in the following 
natural way. 

(j(a) — a 
a{X^...X„)^a{X^)...a{X„) 
o{A a) = (j{A) a{a) 



The notation (t[X 
defined as follows. 



Y] creates a new, extended substitution. 



a[X ^ Y]{Z) 



Y ifx = y, 

cr{Z) otherwise. 



We write (7[X h-)- y] to abbreviate 

a[Xi^Yi]---[Xix\^YiY\]. 

We write [X i-^ Y] to abbreviate ctq [X F] . 

Next we update the definition of a parse tree to include parame- 
terized rules. The formation rule for leaves remains unchanged, but 
the rule for internal nodes becomes as follows. 

If A e A,\/x.A a e V, and a = m> B], then 
17(A) — > tc(a) is a parse tree labeled with o-{A). 

The definition of the language of a CFG with parameterized 
rules requires some care because parameterized rules introduce am- 
biguity. For example, consider the parameterized rule for condi- 
tional expressions given above and the following program. 

if true then else 1 

Instantiating parameter T with either Int or E leads to a complete 
parse. Of course, instantiating with Int is better in that it is more 
specific. We formalize this notion as follows. 

Definition 5.5. We inductively define whether A is at least as 
specific as B, written ^ > _B, as follows. 

1. If B ^ A eV, then A> B. 

2. (reflexive) A > A. 

3. (transitive) If ^ > B and B > C, then A>C. 

We extend this ordering to terminals by defining a > b iff a = b, 
and to sequences by defining 



a > /3 iff la] = \,3\ and > fii fori G {1, 



A parse tree node A — )■ Sq, is at least as specific as another parse 
tree node B — >■ t/j if and only if A > B and Sq >t0. 

We define the least upper bound, Ay B, with respect to the > 
relation in the usual way. Note that a least upper bound does not 
always exist. 

Definition 5.6. The language of a CFG Q with parameterized rules, 
written L{Q), consists of all the strings for which there is a most 
specific parse tree. More formally, 

L{g) = {w\3te T{g,w).yt' € Tig,w).t' ^t^t>t'} 

Next we turn to augmenting our island parsing algorithm to deal 
with parameterized rules. We wish to implicitly instantiate parame- 
terized grammar rules, that is, automatically determine which non- 



terminals to substitute in for the parameters. Towards this end, we 
define a partial function named match that compares two symbols 
with respect to a substitution a and list of variables y and produces 
a new substitution a' (if the match is successful). 



match{X, X, a,y) = a 



match{x, Y, a, y) 



g[x^ Xy Y] 
a[x n> Y] 



if X £ y and 
a{x) = X 
if X G 17 and 
X ^ dom(cr) 



Next, we augment a partial parse tree with a substitution to 
incrementally accumulate the matches. So a partial tree has the 
form Vx. A — a.t/j .7. We then update four of the deduction rules 
as shown below, leaving (Hyp), (Decl), and (ISLND) unchanged. 



(PFnsh) 



g-Hh [w.A^" .t^.,i,j] 

g;Hh[a{A)^t^,iJ] 



(PIPRED) 



(PRCOMPL)- 



(PLCOMPL)- 



g-, H \- [tsji, j] match{B' , B , ao, x) = a 

\/x.A~^ olB'p ev g = {E,A,p,s) 

g-Hh [Vs.A-^" a.tB.I3,i,j] 

g;H \- [yx.A'^'^^ a.Sfi.X''y,i,j] 
g;HV- [tx,j,k] 
match{X ,X,<j\,x) = 02 



g-Wr \\lx.A- 



a-sptx-l, i, k]} 



g;Hh[tx,i,j] 
g;H\- [Vx.A-^^i aX'.Sfi.'y,j,k] 
match{X' , X, ai,x) = (72 



g-Hh [yx. A -^"^ a.txsp.-y, i, k]} 

The above rules ensure that we instantiate type parameters in a 
way that generates the most specific parses for parameterized rules, 
but there is still the possibility of ambiguities in non-parametric 
rules. For example, consider the following grammar. 

Float : : = Int 

Float ::= Float "+" Float 

Int : := Int "+" Int 

The program 

1 + 2 

can be parsed at least three different ways, with no coercions from 
Int to Float, with two coercions, or with just one coercion. To 
make sure that our algorithm picks the most specific parse, with no 
coercions, we make sure to explore derivations in the order of most 
specific first. 

5.3 Grammar Rules witli Variable Binders 

Variable binding and scoping is an important aspect of program- 
ming languages and domain-specific languages are no different in 
this regard. Consider what would be needed to define the grammar 
rule to parse a let expression such as the following, in which n is 
in scope between the curly brackets. 

let n = 7{n*n} 

To facilitate the definition of binding forms, we add two extensions 
to our extensible parsing system: labeled symbols (Jim et al.|2010| 
and a scoping construct fCardel li et al.|[T994| . First, to see an 
example, consider the below grammar rule. 

forall Tl T2. 

T2 ::= "let" x:Id "=" Tl "{" x:Tl; T2 "}" 
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The identifier Id is now labeled with x, which provides a way 
to refer to the string that was parsed as Id. The curly brackets 
are our scoping construct, that is, they are treated specially. The 
x:Tl inside the curly brackets declares that x is in scope during 
the parsing of T2. Effectively, the grammar is extended with the 
rule Tl — > X (but with Tl replaced by the nonterminal that it is 
instantiated to, and with x replaced by its associated string). 

The addition of variable binders and scoping complicates the 
parsing algorithm because we can no longer proceed purely in a 
bottom-up fashion. In this example, we cannot parse inside the 
curly brackets until we have parsed the header of the let expres- 
sion, that is, the variable name and the right-hand side Tl. Our 
parsing system handles this by parsing in phases, where initially, 
all regions of the input enclosed in curly braces are ignored. Once 
enough of the text surrounding a curly-brace enclosed region has 
been parsed, then that region is "opened" and the next phase of 
parsing begins. 

5.4 Rule-Action Pairs and Nonterminal-Type Mappings 

|Sandbergl(1982| introduces the notion of a rule-action pair, which 
pairs a grammar rule with a semantic action that provides code to 
give semantics to the syntax. The following is one of his examples 
but written using our parsing system on top of Typed Racket. 

Integer : := "I" i: Integer "I" => (abs i) ; 

The above example defines syntax for the absolute value operation 
on integers and it defines how to compute the absolute value with 
code in Typed Racket. After a declaration such as the one above, 
the programmer can use the notation I x I in the rest of the current 
scope, including subsequent actions within rule-action pairs. 

In Sandberg's paper, it seems that rule-action pairs behave like 
macros. In our system, we provide rule-action pairs that behave like 
functions as well (with call-by-value semantics). The => operator 
introduces a function (as in the above example) and the = operator 
introduces a macro. For example, one would want to use a macro 
to define the syntax of an if expression (Figure[7]l to avoid always 
evaluating both branches. We refer to a rule-action pair that defines 
a function as a rule-function and a rule-action pair that defines a 
macro as a rule-macro. 

In addition to rule-action pairs, we need a mechanism for con- 
necting nonterminals to types in the host programming language. 
We accomplish this by simply adding syntax to map a nonterminal 
to a type. For example, to abbreviate the Typed Racket Integer 
type as Int, one would write the following in a grammar. 

Int = Integer; 

The implementation of our parsing system translates an input 
program, containing a mixture of Typed Racket plus our grammar 
extensions, into a program that is purely Typed Racket. In the 
following we describe the translation. 

A nonterminal-type mapping is translated into a type alias defi- 
nition. So A = T; translates to 

(define-type A T) 

where T is an expression that evaluates to a type in Typed Racket. 

We use two auxiliary functions to compute the arguments of 
rule-functions and rule-macros for translation. The support of a 
sequence a is the sequence of variables bound in a; the binders 
of a is the sequence of variable bindings in a. In the following 
definitions we use list comprehension notation. 

supp{a) = [xi \ ai G a,ai = Xi : Bi] 

binders{a) = [xi : Bi \ ai € a, at — Xi : Bi] 

For both rule-functions and rule-macros, our system generates 
a unique name / and m, respectively, for use in the Typed Racket 



output. Then a rule-function of the form \/x. — >■ a => e is 
translated to the definition: 

(: / (All (5;) (B -> A))) 
(define / (lambda (supp{a)') e)) 

A rule-macro of the form Va;. A"^ a — e is translated to the 
following: 

(def ine-syntax m 
(syntax-rules () 

((m X supp{a)) e))) 

The type parameters x are passed as arguments to macros so they 
can be used in Typed Racket forms. For example, the rule for 
let expressions in Figure |7] translates to a typed-let expression in 
Racket using the parameter Tl. 

Next we show the translation of parse trees to Typed Racket, 
written . The key idea is that we translate a parse tree for a rule- 
function pair into a function application, and a parse tree of a rule- 
macro pair into a macro application, 

J^/^iJ = (/Iwl) 

p/x. A"' -f" tc\ = (m a{x) p^) 

where in each case a' = binders (a). Terminals simply translate 
as themselves, Ja] = a. 

5.5 Examples 

Here we present examples in which we add syntax for two DSLs to 
the host language Typed Racket. 

5.5.1 Giving ML-like Syntax to Typed Racket 

The module in Figure |7] gives ML-like syntax to several operators 
and forms of the Typed Racket language. The grammar rules for 
Int and Id use regular expressions (in Racket syntax) on the right- 
hand side of the grammar rule. 

We then use this module in the following program and save it to 
the file let . es: 

import ML; 

let n = 7 { 

if n < 3 then print 6 ; 
else print 2 + n * 5 + 5; 

} 

Then we compile it and run the generated let . rkt by entering 
$ esc let.es 

$ racket -I typed/racket -t let. rkt -m 
42 

where esc is our extensible syntax compiler. The result, of course, 
is 42. 

5.5.2 A DSL for Sets 

The module below defines syntax for converting lists to sets, com- 
puting the union and intersection of two sets, and the cardinality of 
a set. Each rule-macro expands to a Racket library call. 

module Sets { 
Int = Integer; 
Set = (Setof Int) ; 
List = (Listof Int) ; 

Set ::= "{" xs:List "}" = (list->set xs) ; 

List ::= x:Int = (list x) ; 

List ::= x:Int "," xs:List = (cons x xs) ; 

Set ::= si: Set "I" s2 : Set [left.l] = 
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module ML { 

// type aliases 
Int = Integer; 
Bool = Boolean; 

//functions 

Int ::= x:Int "+" y:Int [left.l] => (+ x y) ; 
Int ::= x:Int "*" y:Int [left, 2] => (* x y) ; 
Bool ::= x:Int "<" y:Int => (< x y) ; 
forall T. 

Void ::= "print" x:T ";" => (display In x) ; 

// macros 
forall T. 

T ::= "if" t:Bool "then" el:T "else" e2:T = 
(if t el e2) ; 
forall Tl T2. 

T2 ::= "let" x:Id "=" y:Tl "{" x:Tl; z:T2 "}" = 
(let: ([x : Tl y] ) z) ; 
forall Tl T2. 

T2 : := el:Tl e2:T2 [left] = 
(begin el e2) ; 

// tokens 

Int : := #rx"- [0-9] +$" ; 

Id ::= #rx"~[a-zA-Z] [a-zA-Z0-9]*$"; 

} 



Figure 7. An example of giving ML-like syntax to Typed Racket. 



(set-\mion si s2) ; 
Set ::= sl:Set "&" s2:Set [left, 2] = 

(set-intersect si s2) ; 
Int ::= "1" s : Set "1" = (set-count s) ; 

} 

After importing this DSL, programmers can use the set syntax 
directly in Typed Racket. We can also combine the Sets module 
with the ML module from before, for example: 

import ML, Sets; 
let A = {1, 2, 3} { 
let B = {2, 3, 4} { 
let C = {3, 4, 5} { 
print I A & C I ; 
print A I B & C; 

} 

} 

} 

Saving this program in sets . es, we can then compile and run it: 
$ esc sets.es 

$ racket -I typed/racket -t sets.rkt -m 
1 

#<set: 1 2 3 4> 

6. Related Work 

There have been numerous approaches to extensible syntax for pro- 
gramming languages. In this section, we summarize the approaches 
and discuss how they relate to our work. We organize this discus- 
sion in a roughly chronological order. 

In the Lithe language, Sandberg |1982 | merges the notion of 
grammar rule and macro definition and integrates parsing and type 
checking. Unfortunately, he does not describe his parsing algo- 
rithm. [Aasa et al.|fl988J augments the ML language with extensible 



syntax for dealing with algebraic data types. They develop a gener- 
alization of the Barley algorithm that performs Hindley-Milner type 
inference during parsing. However, Pettersson and Fritzson 1 1992| 
report that the algorithm was too inefficient to be practically usable. 
[Pettersson and Fritzson| |1992| build a more efficient system based 
on LR(1) parsing. Of course, LR(1) parsing is not suitable for our 
purposes because LR(1) is not closed under union, which we need 
to compose DSLs. Several later works also integrate type inference 
into the Earley algorithm [M issura 1997. Wieland 2009 1. It may 
be possible to adapt these ideas to enable our approach to handle 
languages with type inference. 

Car delli et al.| |1994| develop a system with extensible syntax 
and lexical scoping. That is, their system supports syntax exten- 
sions that introduce variable binders. Their work inspires our treat- 
ment of variable binders in Section [531 [Cardelli et al.||1994| base 
their algorithm on LL(1) parsing, which is also not closed under 
union. Also, their system differs from ours in that parsing and type 
checking are separate phases. The OCaml language comes with a 
preprocessor, Camlp4, that provides extensible syntax | de Rauglau^ 
[dre,2002J . The parsing algorithm in Camlp4 is "something close to 
L L(1)". 

[Goguen et al.|(1992) provide extensible syntax in the 0BJ3 lan- 
guage in the form of mixfix operators. In 0BJ3, types (or sorts) 
play some role in disambiguation, but their papers do not de- 
scribe the parsing algorithm. There is more literature regarding 
Maude |Clavel etal.|T99 9 |, one of the descendents of 0BJ3. Maude 
uses the SCP algorithm of Quesada 1 1998 1, which is bottom-up and 
bidirectional, much like our island parser. However, we have not 
been able to find a paper that describes how types are used for dis- 
ambiguation in the Maude parser. 

The Isabelle Proof Assistant | Paulson||1994| provides support 
for mixfix syntax definitions. The algorithm is a variant of chart 
parsing and can parse arbitrary CFGs, including ambiguous ones. 
When there is ambiguity, a parse forest is generated and then a 
later type checking pass (based on Hindley-Milner type inference) 
prunes out the ill-typed trees. 

Ranta [2004] develops the Grammatical Framework which in- 
tegrates context free grammars with a logical framework based on 
type theory, that is, a rich type system with dependent types. His 
framework handles grammar rules with variable binders by use of 
higher-order abstract syntax. The implementation uses the Earley 
algorithm and type checks after parsing, similar to Isabelle. 

Several systems use Ford's Parsing Expression Grammar (PEG) 
formalism |Fordj[2004|. PEGs are stylistically similar to CFGs; 
however, PEGs avoid ambiguity by introducing a prioritized choice 
operator for rule alternatives and PEGs disallow left-recursive 
rules. We claim that these two restrictions are not appropriate for 
composing DSLs. The order in which DSLs are imported should 
not matter and DSL authors should be free to use left recursion if 
that is the most natural way to express their grammar. 

[Danielsson and Norell| (2008| investigate support for mixfix 
operators for Agda using parser combinators with memoization, 
which is roughly equivalent to the Earley algorithm. Their algo- 
rithm does not use type-based disambiguation during parsing, but 
they note that a type-checking post-processor could be used to filter 
parse trees, as is done in Isabelle. 

The MetaBorg |Bravenboer et al. 2005] system provides exten- 
sible syntax in support of embedding DSLs in general purpose lan- 
guages. MetaBorg is built on the Stratego/XT toolset which in turn 
used the syntax definition framework SDF (Heering et al.|[T989) 
which uses scannerless GLR to parse arbitrary CFGs. Like Isabelle, 
the MetaBorg system performs type-based disambiguation to prune 
ill-typed parse trees from the resulting parse forest. Our treatment 
precedence and associativity is based on their notion of disam- 
biguation filter (van den Brand et al.|2002J . We plan to explore the 
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scannerless approach in the future. 'Bravenboer and Visser' f20091 
look into the problem of composing DSLs and investigate meth- 
ods for composing parse tables. We currently do not create parse 
tables, but we may use these ideas in the future to further optimize 
the efficiency of our algorithm. 

I Jim et al!|(201 0| develop a grammar formalism and parsing al- 
gorithm to handle data-dependent grammars. One of the contribu- 
tions of their work is ability to bind parsing results to variables that 
can then be used to control parsing. We use this idea in Section [53] 
to enable grammar rules with variable binding. Their algorithm is a 
variation of the Earley algorithm and does not perform type-based 
disambiguation but it does provide attribute-directed parsing. 

7. Conclusions 

In this paper we presented a new parsing algorithm, type-oriented 
island parsing, that is the first parsing algorithm to be constant time 
with respect to the size of the grammar under the assumption that 
the grammar is sparse. (Most parsing algorithms are linear with 
respect to the size of the grammar.) Our motivation for developing 
this algorithm comes from the desire to compose domain- specific 
languages, that is, to simultaneously import many DSLs into a 
software application. 

We present an extensible parsing system that provides a front- 
end to a host language, such as Typed Racket, enabling the defini- 
tion of macros and functions together with grammar rules that pro- 
vide syntactic sugar. Our parsing system provides precedence and 
associativity annotations, parameterized grammar rules, and gram- 
mar rules with variable binders and scope. 

In the future we plan to extend the syntax of nonterminals to 
represent structural types, formally prove the correctness of type- 
oriented island parsing, pursue further opportunities to improve the 
performance of the algorithm, and provide diagnostics for helping 
programmers resolve the remaining ambiguities that are not ad- 
dressed by typed-based disambiguation. 
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